A fast approach for overcomplete sparse 
decomposition based on smoothed i'^ norm 

Hosein Mohimani^, Massoud Babaie-Zadeh^* Member and Christian Jutten^ Fellow 



Abstract 

In this paper, a fast algorithm for overcomplete sparse decomposition, called SLO, is proposed. The 
algorithm is essentially a method for obtaining sparse solutions of underdetermined systems of linear 
equations, and its applications include underdetermined Sparse Component Analysis (SCA), atomic de- 
composition on overcomplete dictionaries, compressed sensing, and decoding real field codes. Contrary to 
previous methods, which usually solve this problem by minimizing the £^ norm using Linear Programming 
(LP) techniques, our algorithm tries to directly minimize the norm. It is experimentally shown that the 
proposed algorithm is about two to three orders of magnitude faster than the state-of-the-art interior-point 
LP solvers, while providing the same (or better) accuracy. 
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A fast approach for overcomplete sparse 
decomposition based on smoothed £^ norm 



I. Introduction 

Finding sparse solutions of Under-determined Systems of Linear Equations (USLE) is of significant 
importance in signal processing and statistics. It is used, for example, in underdetermined Sparse Com- 
ponent Analysis (SCA) and source separation [1], [2], [3], [4], atomic decomposition on overcomplete 
dictionaries [5], [6], compressed sensing [7], [8], decoding real field codes [9], image deconvolution [10], 
[11], image denoising [12], electromagnetic imaging and Direction of Arrival (DOA) finding [13]. Despite 
recent theoretical developments [14], [15], [16], [17], the computational cost of the methods has remained 
as the main restriction, especially for large systems (large number of unknowns/equations). In this article, 
a new approach is proposed which provides a considerable reduction in complexity. To introduce the 
problem in more details, we will use the context of Sparse Component Analysis (SCA). The discussions, 
however, may be easily followed in other contexts and applications. 

SCA can be viewed as a method to achieve separation of sparse sources. Suppose that m source 
signals are recorded by a set of n sensors, each of which records a combination of all sources. In linear 
instantaneous (noiseless) model, it is assumed that x(t) = As{t) in which x(t) and s(t) are the n x 1 
and m X 1 vectors of source and recorded signals, respectively, and A is the n x m (unknown) mixing 
matrix. The goal of Bhnd Source Separation (BSS) [18], [19] is then to find s{t) only by observing x(t). 
The general BSS problem is impossible for the case m> n. However, if the sources are sparse (i.e., not 
a totally blind situation), then the problem can be solved in two steps [1], [2]: first estimating the mixing 
matrix, and then estimating the sources assuming A being known. For sparse sources, the first step - 
which can become very tricky for large m - may be accomphshed by means of clustering [1], [2], [20], 
[21]. The second step requires that for each sample (to) the sparse solution of the USLE x(to) = As(to) 
be found [1], [2], [22], [23]. Note also that the sparsity of the sources is not necessarily in the time 
domain: if T{.} is a linear 'sparsifying' transformation, then T{x} = AT{s}. Due to linearity of T, 
both the linearity of the mixing and the statistical independence properties of sources are preserved in 
the transformed domain. Hence, SCA may be apphed in the transformed domain. 

In the atomic decomposition viewpoint [5], the signal vector x = [x{l), . . . ,x{n)]^ is composed of 
the samples of a 'single' signal x{t), and the objective is to represent it as a linear combination of m. 



September 16, 2008 



DRAFT 



IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. XX, NO. Y, MONTH 2007 



2 



n X 1 signal vectors {a^}^^. After [24], the vectors ai, 1 < i < m are called atoms and they collectively 
form a dictionary over which the signal is to be decomposed. We may write x = ^i^i = -^^^ 

where A = [ai, . . . , a^] is the n x m dictionary (matrix) and s = (si, . . . , Sm)^ is the m x 1 vector of 
coefficients. A dictionary with m > n is called overcomplete. Although, m = n (e.g. Discrete Fourier 
Transform) is sufficient to obtain such a decomposition, using overcomplete dictionaries has a lot of 
advantages in many diverse apphcations (refer for example to [6] and the references in it). In all these 
appUcations, we would hke to use as small as possible number of atoms to represent the signal. Again, 
we have the problem of finding sparse solutions of the USLE As = x. 

To obtain the sparsest solution of As = x, we may search for a solution with minimal £° norm, 
i.e., minimum number of nonzero components. It is usually stated in the literature [6], [9], [4] that 
searching the minimum £^ norm is an intractable problem as the dimension increases (because it requires 
a combinatorial search), and it is too sensitive to noise (because any small amount of noise completely 
changes the £^ norm of a vector). Consequently, researchers consider other approaches. One of the most 
successful approaches is Basis Pursuit (BP) [5], [15], [4], [25] which finds the minimum £^ norm (that is, 
the solution of As = x for which J2i is minimized). Such a solution can be easily found by Linear 
Programming (LP) methods. The idea of Basis Pursuit is based on the observation that for large systems of 
equations, the minimum norm solution is also the minimum f norm solution [14], [15], [5]. By using 
fast LP algorithms, specifically interior-point LP solvers, large-scale problems with thousands of sources 
and mixtures become tractable. However, it is still very slow, and in the recent years several authors have 
proposed improvements for BP, to speed up the algorithm and to handle the noisy case [16], [6], [10], 
[II]. Another family of algorithms is Iterative Re-weighted Least Squares (IRLS), with FOCUSS [13] 
as an important member. These are faster than BP, but their estimation quality is worse, especially if the 
number of non-zero elements of the sparsest solution is not very small. Another approach is Matching 
Pursuit (MP) [24], [26], [I] which is very fast, but is a greedy algorithm and does not provide good 
estimation of the sources. The approach presented in [27] is also very fast, but adjusting its parameters 
is not easy. 

Contrary to previous approaches, the method we present in this paper is based on direct minimization 
of the £^ norm. We will see that our method performs typically two to three orders of magnitude faster 
than BP (based on interior-point LP solvers), while resulting in the same or better accuracy. We have 
already briefly reported the basics of this approach in [28] and its complex version in [29]. However, in 
this paper, we are going to present a highly more complete description of this approach and consider, 
mathematically and/or experimentally, its convergence properties and the effects of its parameters. 
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The paper is organized as follows. The next section introduces the basic principles of our approach. 
The final algorithm is then stated in Section |llll In Section JVl convergence properties of the algorithm 
is discussed. Finally, Section |V] provides some experimental results of our algorithm and its comparison 
with BP. 

II. Basic Principles Of Our Approach 

A. The Main Idea 

The problems of using norm (that is, the need for a combinatorial search for its minimization, and its 
too high sensibility to noise) are both due to the fact that the £^ norm of a vector is a discontinuous function 
of that vector. Our idea is then to approximate this discontinuous function by a suitable continuous one, 
and minimize it by means of a minimization algorithm for continuous functions (e.g. steepest descent 
method). The continuous function which approximates ||s||o, the £^ norm of s, should have a parameter 
(say a) which determines the quality of the approximation. 

For example, consider the (one-variable) family of functions: 

Us)^exp{-sy2a^), (1) 



and note that: 



or approximately: 



Then, by defining: 



lim /^(s) 

cr^O 



Us) 



1 ; if s = 

(2) 

; if s / 



1 ;if Isl < o- 

■ (3) 

:if Isl > 0- 



F.{s) = Y,Usi), (4) 

1=1 

it is clear from and (|3]l that ||s||o ~ m — Fcr(s) for small values of a, and the approximation tends to 
equality when cr ^ 0. Consequently, we can find the minimum ^''-norm solution by maximizing -Fcr(s) 
(subject to As = x) for a very small value of a. Note that the value of a determines how smooth the 
function is: the larger value of a, the smoother (but worse approximation to £'^-norm); and the 
smaller value of a, the closer behavior of Fa- to ^'^-norm. 

Note that for small values of a, F^ is highly non-smooth, and contains a lot of local maxima, and 
hence its maximization is not easy. On the other hand, for larger values of a, F^ is smoother and contains 
less local maxima, and its maximization is easier (we will see in the next subsection that there is no 
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local maxima for large enough cr's). Consequently, our idea is to use a 'decreasing' sequence for a: for 
maximizing for each value of a (using e.g. gradient algorithms), the initial value of the maximization 
algorithm is the maximizer of F^r for the previous (larger) value of a. If we gradually decrease the 
value of a, for each value of a the maximization algorithm starts with an initial solution near to the 
actual maximizer of (this is because a and hence have only slightly changed and consequently 
the maximum of the new F^ is eventually near to the maximum of the previous F^), and hence we 
eventually escape from getting trapped into local maxima and reach to the actual maximum for small 
values of a, which gives the minimum ^''-norm solutior^. 

Note that the basic idea holds not only for Gaussian family of functions given in ([T|), but also for 
any family of functions f^j which approximates the Kronecker delta function, i.e., satisfies Q and Q. 
For example, it also holds for the family of 'triangular' functions: 



Us) 



(O- + S)/(T 
{(T-S)/(T 

and for the family of 'truncated hyperbolic' functions: 



1 



if |s| > a 

if -o- < s < , (5) 
if < s < o- 



,1 ;if Isl > cr 

Us) = { , (6) 

l-(s/cj)2 ;if|s|<o- 

and also for the family of functions: 

f,{s) = a^/{s^ + a^). (7) 

B. Initialization 

Up to now, the behavior of the function fa- was discussed for small values of a. It is also interesting 
to consider the behavior for very large values of a. 

More specifically, it can be shown that "for sufficiently large values of a, the maximizer ofF„{s) subject 
to As = X is the minimum £^-norm solution of As = x, i.e., the solution given by the pseudo-inverse 
of A". Here, we give only a justification to this property for the case of Gaussian family of functions 
introduced in ([T]) by using Lagrange multipliers, and we leave the formal proof to Section IIV-BI 

Using the method of Lagrange multipliers, for maximizing Fcr(s) = Yl^i fo-is-i) = Z^I^Li exp (— s^/2(T^) 
subject to As = x, we set the derivative of the Lagrangian £(s, A) = Fct(s) — A^(As — x) with respect to 

'This technique for optimizing a non-convex function is usually called Graduated Non-Convexity (GNC) [30]. 
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s and A equal to zero, which gives the following Karush-Kuhn-Tucker (KKT) system of m + n nonlinear 
equations of m + n unknowns (m components of s, and n components of A): 

[sie-^?/2-^ , . . . , Sme-'^^/^^T - A^Ai = 

(8) 

As - X = 

V 

where Ai = — u^A. 

On the other hand, the minimum ("^ norm solution of As = x may be found by minimizing ^s^s 
subject to As = x. Using Lagrange multipUers, this minimization results in the system of equations: 

A^A = 

As - X = 

Comparing systems ([8]l and we see that for o" — > oo (or where a ^> maxjsi, . . . , Sm}), these two 
systems of equations are identical, and hence the maximizer of Fcr(s) is the minimum ^^-norm solution 
of As = X. 

III. The Final Algorithm 

The final algorithm, which we call SLO (Smoothed £^), is obtained by applying the main idea of the 
previous section on the Gaussian family ([T]), and is given in Fig. [T] 

Remark 1. The internal loop (steepest ascent for a fixed a) is repeated a fixed and small number of 
times (L). In other words, for increasing the speed, we do not wait for the (internal loop of the) steepest 
ascent algorithm to converge. This may be justified by the gradual decrease in the value of a, and the 
fact that for each value of a, we do not need the exact maximizer of F^. We just need to enter the region 
near the (global) maximizer of Fa- for escaping from its local maximizers. See also Remarks 3 to 5 of 
Section HV-Al 

Remark 2. Steepest ascent consists of iterations of the form s <— s + fijVF^{s). Here, the step-size 
parameters fij should be decreasing, i.e., for smaller values of a, smaller values of fij should be applied. 
This is because for smaller values of a, the function is more 'fluctuating', and hence smaller step- 
sizes should be used for its maximization. In fact, we may think about changing the value of o" in ([T]) 
and ^ as looking at the same curve (or surface) at different 'scales', where the scale is proportional to 
a. For having equal (i.e., proportional) steps of the steepest ascent algorithm in these different scales, 
it is not difficult to showa that /ij should be proportional to u^. Note that in Fig. [H instead of /Xj, 

^To see this, suppose that si = rai in F^-^ corresponds to S2 = ra2 in Fcr2- Then /ii V-Fo-j (si)//i2 VFo-j (52) = oxja^ 
results in /ii//i2 ~ (j\la\. 
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• Initialization: 

1) Let §0 be equal to the minimum norm solution of As = x, 
obtained by pseudo-inverse of A. 

2) Choose a suitable decreasing sequence for u, [cti . . . oj] (see 
Remarks 5 and 6 of the text). 

. For j = 1,..., J: 

1) Let G — ai. 

2) Maximize (approximately) the function Fa on the feasible 

set iS = {s I As — x} using L iterations of the steepest 
ascent algorithm (followed by projection onto the feasible 
set): 

- Initialization: s = Sj_i. 

- For (. — 1 . . . L (loop L times): 

a) Let S = [si exp (— Si/2(7^), . . . , s„ exp {~s'^/2a^)]'^ . 

b) Let s ^ s — fiS (where is a small positive constant). 

c) Project s back onto the feasible set 5: 

|_ s^s- A^(AA^)"^(As-x). 

3) Set Sj = s. 

• Final answer is s = §,/. 

Fig. 1. The final SLO algorithm. 

only a constant jj. is appeared. The reason is that by letting fij = jj.a'^ for some constant /i, we have 
s ^ s + {fia'^)VF^ = s - fiS, where S = -a'^VF„ = [si exp (-sf/2(j^), . . . , Snexp (-s^/2cr^)]^. 

Remark 3. According to the algorithm, each iteration consists of an ascent step Sj <— Sj — 
/iSj exp(— s?/2cj^), 1 < i < m, followed by a projection step. If for some values of i we have \si\ » cr, 
then the algorithm does not change the value of Sj in that ascent step; however it might be changed in 
the projection step. If we are looking for a suitable large (to reduce the required number of iterations), 
a suitable choice is to make the algorithm to force all those values of s-i satisfying | s j | < a toward zero. 
For this aim, we should have /iexp(— s^/2ct^) 1, and because exp(— s?/2cr^) < 1 for \si\ < a, the 
choice > 1 seems reasonable. 

Remark 4. The algorithm may work by initializing sq (the initial estimation of the sparse solution) 
to an arbitrary solution of As = x. However, the discussion of Section III-BI shows that the best initial 
value of §0 is the minimum £^ norm solution of As = x, which corresponds to a ^ oo. In another point 
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of view, one may think about the minimum i'^ norm solution as a rough estimate of the sparse solution, 
which will be modified in the future iterations of the algorithm. In fact, calculating minimum £^ norm 
is one of the earliest approaches used for estimating the sparsest solution and is called the Method Of 
Frames (MOF) [5]. 

Remark 5. Having initiated the algorithm with the minimum £^ norm solution (which corresponds to 
a = oo), the next value for a (i.e., ui) may be chosen about two to four times of the maximum absolute 
value of the obtained sources (maxj \ si\). To see the reason, if we take for example a > 4maxj \ si\, then 
exp(-s2/2cr2) > 0.96 w 1 for all 1 < i < m, and comparison with Q shows that this value of a acts 
virtually like infinity for all the values of Sj, 1 < i < m (the next remark, too, provides another reason 
through another viewpoint to the algorithm). 

For the next values of a, we have used aj = caj-i, j > 2, where c is usually chosen between 0.5 
and 1. Its effect is experimentally studied in Section IVl). 

Remark 6. Equation ^ seems to simply count the "inactive" elements of s. However, instead of 
hard-thresholding "inactive = \si\ < a; active = |sj| > a", criterion ^ uses a soft-thresholding, for 
which a is the rough threshold. 

Remark 7. In applications where the zeros in the sparsest s are exactly zero, a can be decreased 
arbitrarily. In fact, in this case, its minimum value is determined by the desired accuracy, as will be 
discussed in Theorem [T] For applications in which inactive elements of s are small but not exactly 
zero (say that the 'source' vector is noisy), the smallest a should be about one to two times of (a rough 
estimation of) the standard deviation of this noise. This is because, while a is in this range, ([3]) shows that 
the cost function treats small (noisy) samples as zeros (i.e., for which fa{si) ~ 1). However, below this 
range, the algorithm tries to Team' these noisy values, and moves away from the true answer (according 
to the previous remark, the soft threshold should be such that all these noisy samples be considered 
inactive). Restricting cjj to be above the standard deviation of the noise, provides the robustness of this 
approach to noisy sources (or mixtures), which was one of the difficulties in using the exact norm. 

IV. Theoretical Analysis of the Algorithm 

A. Convergence Analysis 

In this section, we try to answer two questions for the noiseless case (the noisy case will be considered 
in Section HV-CI ): a) Does the basic idea of Section |ll] results in convergence to the actual minimizer of 
the 1^ norm (assumed to be unique by [13], [15])? and b) If yes, how much should we decrease a to 
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achieve a desired accuracy? 

Note that the algorithm of Fig. [T] has two loops: the external loop which corresponds to the basic 
ideas of Section |II] for finding the sparsest solution, and the internal loop which is a simple steepest 
ascent algorithm for maximizing ^^(s) for a fixed a. In the analysis of this section, it is assumed that 
the maximization of ^^(s) has been exactly done for a fixed a (the maximization algorithm has not got 
trapped into local maxima). Note that we had proposed the gradual decrease in a to escape from getting 
trapped into local maxima when maximizing for a fixed a. A theoretical study to find the series aj, 
j = I, ... J, which guaranties the convergence is very tricky (if possible) and is not considered in this 
paper. However, it will be experimentally addressed in the next section. 

Assuming the maximization of Fa- for fixed u's is perfectly done, we show here that the estimation 
given by the algorithm converges to the unique minimizer of the norm. In other words, we prove that 
the sequence of 'global' maximizers of F^'s will converge to the sparsest solution (which is the basic 
idea of Section JIl), and try to answer both above questions. 

Before stating the convergence theorem (Theorem [T]), we state three lemmas. Recall that null(A) = 
{s|As = 0}. 

Lemma 1: Assume that the matrix A = [ai,a2,--- ,Shn] G ]K"x™ (where represents the i-th 
column) has the property that all of its n x n sub-matrices are invertible, which is called the Unique 
Representation Property (URP) in [isj^. If m — n elements of s G null(A) converge to zero, then all of 
its elements (and hence s) will converge to zero, too. 

Proof: Without loss of generality, assume that all the columns of A are normalized, i.e. ||aj|| = 1, 
1 < i < m (throughout the paper, || • || stands for the l"^ or Euclidean or Frobenius norm of a vector or 
matrix). Then, we have to show: 

V/3 > 0, 3a > 0, such that Vs G null(A) : 

m — n elements of s have absolute values (10) 
less than a =^ ||s|| < /? 

Let s = ( ■Si; •S2, • • • , Sm) be in null(A) and assume that the absolute values of at least m — n elements 
of it are smaller than a. Let be the set of all indices i, for which \si\ > a. Consequently, |/q,| < n, 

^URP of A also guaranties that the sparsest solution is unique [13], [15]. 
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(11) 




where \X\ represents the cardinaUty (i.e., number of elements) of a set X. Then we write: 

m 

SiS-i = =^ Siai + ^ Si^i = =^ 

II ^ ^ Sjaj|| = II ^ ^ Sjaj|| < ^ ^ ||sjaj|| = 
«e/„ 

< a = {m — |lQ,|)a < ma 

Let A be the sub-matrix of A containing only those columns of A that are indexed by the elements 
of la- Thus A has at most n columns, and the columns of A are linearly independent, because of the 
URP of A. Therefore, there existcl a left inverse A^^ for A. Let s and s denote those sub-vectors of s 
which are, and which are not indexed by la, respectively. Then: 

J] Si^i =As ^ ||s|| = ||(A"i)(^ Siai)|| 

< ||A~"'"|| • II ^ Sjaj|| < ||A^"'"||(mQ!) 
ll^ll ^ Z]j^/<, 1^*1 < ("^ - \la\)OL < ma 

||s|| < ||A~-'-||ma J (13) 

||s|| < ||s|| + ||s|| < (||A"^|| + l)ma 

Now, let Ai be the set of all submatrices A of A, consisting of at most n columns of A. Then M. is 
clearly a finite set (in fact < 2™). LeS 

M = max{||A-i|| |AgA^}, (14) 

then 

||s|| < (||A"i|| + l)ma <{M + l)ma. (15) 

M is a constant and its value depends only on the matrix A. Therefore, for each f3 it suffices to choose 
a = (3/m{M + l). ■ 
The above proof (calculations (fTTIl to (fTSl l) results also in the following corollary: 

*Not that A is not necessarily a square matrix and hence is not necessarily invertible. But it has a left inverse, which is not 
necessarily unique. In this case A^^ is just 'one' of these inverses. For example, since A is tall and full-rank, its Moore-Penrose 
pseudoinverse is one of these inverses. 

^Note that the calculation of M is difficult in the cases where m and n are large. Calculation of the exact value of M requires 
a computation complexity larger than which can be impractical for large values of m and n. 
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Corollary 1: If A G ^nxm satisfies URP, and s G null(A) has at most n elements with absolute 
values greater than a, then ||s|| < (M + l)ma, where M is as defined in ([T4l ). 

Lemma 2: Let a function /cr(s) have the properties /cr(0) = 1 and Vs,0 < fais) < 1, and let Fcr(s) 
be defined as in Assume A satisfies the URP, and let 5 = {s | As = x}. Assume that there exists a 
(sparse) solution s'^ G 5 for which ||s'^||o = k < n/2 (such a sparse solution is unique [13], [15]). Then, 
if for a solution s = (si, . . . , Sm)^ G S: 

F^{s) >m- {n- k), (16) 
and if a > is chosen such that the Sj's with absolute values greater than a satisfy fa{si) < then: 

||s-s°|| < {M + l)ma, (17) 

where M is as defined in ([T4l) . 

Proof: Let la be the set of all indices i for which \si\ > a, and denote its number of elements by 
\Ia\- Then: 

m 

1=1 




<ni-\U <m-i=l 

Combining this result with ( [T6l ). we obtain: 

m — (n — k) < Fa-{s) < m — \Ia\ + 1 

=^=- \Ia\ < n — k + 1 ^ \Ia\ < n — k. 
Consequently, at most n — k elements of s have absolute values greater than a. Since s'^ has exactly k 
non-zero elements, we conclude that s — s'^ has at most {n — k) + k = n elements with absolute values 
greater than a. Moreover, (s — s'^) G null(A) (because A(s — s'^) = x — x = 0), and hence Corollary \T\ 
implies (ITtI ). ■ 

Corollary 2: For the Gaussian family ([T]), if ([T6l ) holds for a solution s, then: 

||s - s°|| < (Af + l)mcrV21nm. (18) 



Proof: For Gaussian family ([T|l, the a of the above lemma can be chosen as a = o"V2 In m, because 
for \si\ > (T\/21nm: 



m 
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Moreover, this family satisfies the other conditions of the lemma. ■ 

Lemma 3: Let fa-, F^, S and s*^ be as in Lemma |2j and let s'^ be the maximizer of Fa{s) on S. Then 
s'^ satisfies ( fT6l ). 

Proof: We write: 

Ffj{s'^) > Fcr{s^) (because s*^ is the maximizer) 

> m — k (see below) (19) 

> m — {n — k) (because k < ^). 

The second inequality was written because s*^ has m — k zeros, and hence in the summation ^ there 
we. m — k ones, and the other terms are non-negative. ■ 

Note that Lemma[3]and Corollary |2]prove together that for the Gaussian family ([T]), argmax^g^^ -^(t(s) - 
s*^ as fj ^ 0. This result can, however, be stated for a larger class of functions /^r, as done in the following 
Theorem. 

Theorem 1: Consider a family of univariate functions indexed by o", cr G M^, satisfying the set of 
conditions: 

1) lim^^o/a(s) = ;foralls/0 

2) /^(O) = 1 ; for all cj e M+ 

3) < fa{s) < 1 ; for all a G M+, s G M 

4) For each positive values of v and a, there exists ctq € M"*" that satisfies: 

\s\> fa{s) < V ; for all a < ao- (20) 

Assume A satisfies the URP, and let S, F^ and s*^ be as defined in Lemma |2j and s*^ = (sf^, . . . , s^)^ 
be the maximizer of Fa{s) on S. Then: 

lim s'^ = s°. (21) 
Proof: To prove (|2T]) . we have to show that: 

V/3>0 3^0 > 0, Vct<cto US'" - s°|| < /3. (22) 

For each P, let a = (3 /m{M + \), where M is as defined in ([T4l ). Then for this a and v = condition|4] 
of the theorem gives a ctq for which (l20l ) holds. We show that this is the do we were seeking for in (l22l) . 
Note that Vo" < fio, (l20l ) states that for 's with absolute values greater than a we have /^(sf ) < 
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Moreover, Lemma [3] states that s*^ satisfies ([T6l ). Consequently, all the conditions of Lemma |2] have been 
satisfied, and hence it implies that Hs*^ — ^^\\ < {M + l)ma = 13. ■ 

Remark 1. The Gaussian family ([T|l satisfies conditions 1 through 4 of Theorem [T] In fact, conditions 
1, 2 and 3 are obvious. To see condition 4, it is sufficient to choose = — a^/(21ni^) if < 1, or to 
choose any arbitrary cJq G if i/ > 1. Families of functions defined by (ID, ^ and ([7]) also satisfy the 
conditions of this theorem. 

Remark 2. Using Corollary |2l where using Gaussian family ([1]), to ensure an arbitrary accuracy j3 
in estimation of the sparse solution s'^, it suffices to choose: 

a < , , 

m\/2 In m(M + 1) 

and do the optimization of F^r subject to As = x. 

Remark 3. Consider the set of solutions s'^ in S, which might not be the absolute maxima of 
functions F„ on S, but satisfy the condition 

F„{s'')>m- {n-k). (23) 

By following a similar approach to the proof of Theorem [T] it can be proved that linicr^o ^" = s'^. In other 
words, for the steepest ascent of the internal loop, it is not necessary to reach the absolute maximum. It 
is just required to achieve a solution in which F^ is large enough (see also Remark 1 of Section Hill ). 

Remark 4. The previous remark proposes another version of SLO in which there is no need to set 
a parameter L: Repeat the internal loop of Fig. [T] until Fcr(s) exceeds m — n/2 (the worst case of the 
limit given by (|23] )) ox m — {n — k) ii k is known a priori (note that (fT9l ) implies the maximizer of 
F(j(s) for a fixed a surely exceeds both of these limits). The advantage of such a version is that if it 
converges, then it is guaranteed that the estimation error is bounded as in dTSl ). in which a is replaced 
with (Tj. It has however two disadvantages: firstly, it slows down the algorithm because exceeding the 
limit m — {n — k) for each a is not necessary (it is just sufficient); and secondly, because of the possibility 
that the algorithm runs into an infinite loop because Ffj{s) cannot exceed this limit (this occurs if the 
chosen sequence of a has not been resulted in escaping from local maxima). 

Remark 5. As another consequence. Lemma [T] provides an upper bound on the estimation error 
||s — s'^ll, only by having an estimation s (which satisfies As = x): Begin by sorting the elements of s 
in descending order and let a be the absolute value of the ([^J + l)'th element. Since s*^ has at most 
n/2 non-zero elements, s — s*^ has at most n elements with absolute values greater than a. Moreover, 
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(s — s*^) G null(A) and hence Corollary [T] implies that ||s — s*^!! < (M + l)ma, where M is as defined in 
([T4l) . This result is consistent with the heuristic "if s has at most n/2 Targe' components, the uniqueness 
of the sparsest solution insures that s is close to the true solution". 

B. Relation to minimum norm 2 solution 

In section III-BI it was stated and informally justified (for the Gaussian family ([T|l) that for very large 
it's, the maximizer of the function subject to As = x is the minimum £^-norm solution of As = x. 
This result can be more accurately proved, and also generalized to a wider class of functions: 

Theorem 2: Consider a family of one variable functions fa{-), parameterized by o" G M+, satisfying 
the set of conditions: 



1) All functions fa are scaled versions of some analytical function /, that is, fa{s) = f{s/a) 

2) Vs G M, < f{s) < 1 

3) f{s) = l^s = 

4) /'(O) = 

5) /"(O) < 

Assume that the matrix A is full-rank and let s = argmin^g^^ ll^ll = A^(AA^) ^x be the minimum 
£^-norm solution of the USLE As = x. Then: 



Proof: Let s"" = (sj, . . . , s^)^ = argmaxy^g^^ Fo-(s). Then, we have to show that limo— ^oo s"^ = 



lim argmaxF(j(s) = s. 

f^-^oo As=x 



S = 



T 



First we show that: 



lim — = 0. 



(24) 



Since s"^ is the maximizer of F^j, we have: 



(25) 



and hence: 



m 




= m 



m 




(26) 



i=l 
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On the other hand, assumption 2 implies that for all 1 < i < m, < limcr^oo /(sf /o") < 1. Combining 
this with (l26l ). we have: 

lim /(sf/cj) = 1 ;forl<i<m. (27) 

This result, combined with assumption 3 (that is, = 0) and the continuity of / implies that for 

all 1 < i < m, limCT_»oo sf/c" = 0; from which (l24l) is deducted. 
Now, let 7 = ^/"(O) > 0. Then we can write 

f{s) = l-js^ + g(s), 

where: 

lim ^ = 0. (28) 

Then: 

m m 

F^(s) = m - ^^s^ + ^5f(si/cj). 

^ i=l i=l 

Consequently, ( [25] ) can be written as: 



mm mm 

i=l i=l i=l i=l 

2 m 2 

^ 1=1 i=i 



where for the last inequality, we have used the inequality: 

I xiVji < E'^^iEi^ji- 

i€l,j€J iei jeJ 
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Finally: 



Is'^f < ll§ll'- 



lim Si/a = ^ lim ^j^'/*^] = (from (|28]l), 
lim /fj = ^ lim f^T^ = (from ([28])), 

^ lim lls'^f < ||sf . (29) 

Noting that s is the minimum £^-norm solution of As = x, \\s'^\\'^ > ||s|p, and hence limo-^oo ll^'^lP ^ 
||s|p. Combining this with ( |29l l. we have: 

lim lls^lp = ||sf. (30) 

On the other hand, since s is the minimum ^^-norm solution of As = x, it is perpendicular to any vector 
contained in null(A). This is because Vv G null(A), Av = 0, and hence v-^s = v^A-^(AA-^)^^x = 
(Av)^(AA^)^^x = 0. Consequently, s is perpendicular to s"^ — s. Therefore: 

||S-||2 = ||s||2 + ||s--sf 

=^ lim lls'^lp = ||s||^ + lim ||s'^ - s|p. 

a — ^oo cr — ^oo 

Combining this with (l30l ) we have limcr^oo lls"^ — s|p = 0, and hence liuia-^oo s"^ = s. ■ 

Remark 1. The Gaussian family (dJ satisfies the conditions 1 through 5 of Theorem [2] Therefore, for 
this family of functions, the minimum ^^-norm solution is the optimal initialization. Family of functions 
defined by ^ also satisfies the conditions of this theorem, contrary to those defined in ^ and Q which 
are not analytic. 

C. The noisy case 

As shown in the proof of Theorem [1] in the noiseless case, a smaller value of a results in a more 
accurate solution and it is possible to achieve solutions as accurate as desired by choosing small enough 
values of a. However, this is not the case in the presence of additive noisq3> that is, if x = As + n. In 

*The 'noise' in tiiis context lias two meanings: 1) the noise in the source vector s means that the inactive elements of s are 
not exactly equal to zero; and 2) the (additive) noise in the sensors means that x is not exactly equal to As. In the theorems 
of this section, only the second type of noise has been considered, and it is assumed that the first type does not exist. In other 
words, the inactive elements of s are assumed to be exactly zero. 
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fact, noise power bounds maximum achievable accuracy. We state a theorem in this section, which can 
be considered as an extension of Theorem [T] to the noisy case. 

First, we state the following lemma, which can be considered as a generalization to Lemma [T] 

Lemma 4: Let A satisfy the conditions of Lemma [T] and assume that the vector s has m — n elements 
with absolute values less than a, and ||As|| < e. Then ||s|| < /3, where 



and M is as defined in (fT4l) . 

Note that in this lemma, instead of condition As = 0, we have a relaxed condition || As|| < e. Lemma[T] 
is the special (noiseless) case of this lemma where e ^ 0. 



Theorem 3: Let = {s| ||As — x|| < e}, where e is an arbitrary positive number, and assume that 
the matrix A and functions f^^ satisfy the conditions of Theorem [T] Let G 5e be a sparse solution, 
and assume that satisfies the extra conditions: 

1) There exists 7 > such that 



/? = (M + l)(ma + e) 



Proof: Let la, A, s, s and M be defined as in the proof of Lemma [T] Then 



m 






— /o-(s)| < j/a ; for all cr > and all s 



2) For each positive values of u and ctq, there exists an a > that satisfies: 



s\ > a ^ fa{s) < V ; for all a < uo 




(31) 



(M + l)(ma + e) 
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where a is the value for which the condition 2 holds for oq and v = 1/m. 

Proof: Let n = As'^ — x. Then, s*^ G means that ||n|| < e. By defining n = A^(AA^)~^n, we 
have: 

X = As° + n = As° + An = A(s° + n) = As, 

where s = s'^ + n. Let s'^ be the maximizei0 of on As = x, as defined in Theorem [T] When working 
with £'^-norm, no matter how much small is e and how much sparse is s'^, s is not necessarily sparse. 
However, as will be discussed, because is continuous and ||n|| is small, the value of at s is close 
to its value at s'^ (and thus, is large). In fact: 

F,(s) = F,(sO + n) ~ F,(sO) + VF,(s°) • n. 

By defining g{t) = F„{s^ + fit), we have g{ff) = F<^(s°) and ^(l) = F^(s° + n) = F^(s). Using the 
mean value theorem, there exists a < t < 1 such that: 

|F.(s) - F.(sO)| = \g{l) - 5(0)1 < (1 - ^)g'{t) 

= VF^(s° + nt) • n < ||VF^(s° + fit)|| • ||n|| 



We write: 



Vs \4-Ja{s)\ < 7/cT ||VF,(sO + nt)|| < m^/a 



n 



A^ (AA^ )-%|| < ||A^ (AA 



T\-l| 



la 



(s) > m - (n - k) 



|F„(s) - F,(sO)| < m7e||A^(AA^)~i| 

Let choose ctq according to (|3T] ). Then: 

|F,„(s)-F,„(sO)| <n-2k 
i^ao(s°) >m-k 

The vector does not necessarily satisfy As = x, however we have chosen s to be the projection of 
s'^ onto the subspace As = x. Hence, s satisfies As = x and since s*^" is the maximizer of Ffj^ on 
As = X, Fcro(s'^") > m — {n — k). Consequently, by choosing a as the value for which the condition 
2 holds for v = 1/m and ctq, and following the same steps as in the proof of Theorem [T] we conclude 
that at most n — k elements of s"^" can have absolute values greater than a. Then, since s*^ has at most 
k nonzero elements, (s'' — s"^") has at most n elements with absolute values greater than a. Noticing 



|A(sO 



I As'' — x|| < e, we see that (s'^ — s*^") satisfies the conditions of Lemma |4l and hence: 

||s°-s'^"|| < (M + l)(ma + e). (32) 



^Note that, s"^ is not necessarily maximizer of Fa on the whole 5e 
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Remark 1. A few calculations show that the Gaussian families (O satisfies the condition 1 of the 
theorem for 7 = exp(— 1/2) and the condition 2 for a = — (7oa/2 ln(i/). Family of functions defined by 
^ also satisfy the conditions of this theorem. 

Remark 2. Note that for Gaussian family of functions and under the condition k < n/2, accuracy 
of the solution is proportional to the noise powei^. In fact, we have accuracy of at least C ■ e, where: 



If e ^ 0, by choosing o"o according to (|3TI ). s'^" converges to s^. 

Remark 3. According to Theorem [3l in contrast to the noiseless case, it is not possible here to 
achieve arbitrarily accurate solutions. Accuracy is bounded by the noise power, and to guaranty an error 
estimation less than /3 using Theorem [3l it is required to satisfy e < (3/C. 



In this section, the performance of the presented approach is experimentally verified and is compared 
with BP (and with FOCUSS for the first experiment). The effects of the parameters, sparsity, noise, and 
dimension on the performance are also experimentally discussed. 

In all of the experiments (except in Experiment 3), sparse sources are artificially created using a 
Bernoulli-Gaussian model: each source is 'active' with probability p, and is 'inactive' with probability 
1 — p. If it is active, each sample is a zero-mean Gaussian random variable with variance a'^^; if it is 
not active, each sample is a zero-mean Gaussian random variable with variance cr^g, where cj^g <C a'^^. 
Consequently, each Sj is distributed as: 



where p denotes the probability of activity of the sources, and sparsity implies that p 1. (Jofr models 
the noise in the sources, that is, small values of the sparse sources in their inactive case. This parameter 
is mostly meaningful in SCA applications, in which, usually the sources in their inactive states are not 
exactly zero. However, in sparse decomposition applications fioff can be usually set to zero, that is, most 
elements of the dictionary are absent in the decomposition. 

In our simulations, (Ton is always fixed to 1. The effect of fioff is investigated only in the first experiment. 
In all the other experiments it is set to zero. 

^Optimal choice of ao is also proportional to the noise power. 
September 16, 2008 DRAFT 




V. Experimental Results 



Si ~ p • 7^(0, cTon) + {l-p)- AA(0, O-off) 
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itr. # 


a 


MSB 




1 


1 


/I Q/1 o 

4.o4 e — 




z 


L.oL 




U.J 




o 

z 


^ 1 Q 

J. 17 






4.96 e - 


3 




4 


0.1 


2.30 e - 


3 


16.44 


5 


0.05 


5.83 e - 


4 


20.69 


6 


0.02 


1.17e- 


4 


28.62 


7 


0.01 


5.53 e- 


5 


30.85 


algorithm 


total time (sec) 


MSE 




SNR (dB) 


SLO 


0.227 


5.53 e - 


5 


30.85 


LP (^i-magic) 


30.1 


2.31 e- 


4 


25.65 


FOCUSS 


20.6 


6.45 e- 


4 


20.93 



Each column of the mixing matrix is randomly generated using the normal distribution and then is 
normalized to unity. Then, the mixtures are generated using the noisy model: 

X = As + n, (34) 

where n is an additive white Gaussian noise (modeling sensor noise, or decomposition inaccuracy) with 
CO variance matrix cJnln (where I„ stands for the n x n identity matrix). 

To evaluate the estimation quality, Signal-to-Noise Ratio (SNR) and Mean Square Error (MSE) are 
used. SNR (in dB) is defined as 201og(||s||/||s — s||) and MSE as (l/m)||s — s|p, where s and s denote 
the actual source and its estimation, respectively. 

Using (l33l) . the number of active sources has a binomial distribution with average mp. In the experi- 
ments, we will use the parameter k = mp, instead of p. 

Experiment 1. Performance analysis 

In this experiment, we study the computational cost of the presented method, and compare its perfor- 
mance with £i-magic [25] as one of the fastest implementations of interior-point LP, and with FOCUSSy. 
In rest of the paper, by LP we mean £i -magic implementation of the interior point LP 

The values used for the first part of the experiment are m = 1000, n = 400, p = 0.1, a^^ = 0, 
(Ton = 1, cTn = 0.01 and the sequence of a is fixed to [1, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01]. /i is fixed to 2.5. 

'For FOCUSS, we have used the MATLAB code available at |http ://dsp.ucsd.edu/~j fmurray/ software ■ htni| 



September 16, 2008 



DRAFT 



ffiEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. XX, NO. Y, MONTH 2007 



20 




'0 200 400 600 800 1000 



Fig. 2. Evolution of SLO toward the solution: m = 1000, n = 400 and k = 100(p = 0.1). From top to bottom, first plot 
corresponds to the actual source, second plot is its estimation at the first level (cr = 1), third plot is its estimation at the second 
level (a = 0.5), while the last plot is its estimation at third level (a = 0.2). 

For each value of a the gradient-projection loop (the internal loop) is repeated three times, i.e., L = 3 
(influence of L is discussed in part of experiment 2; in all other experiments /i and L are fixed to 2.5 
and 3). 

We use the CPU time as a measure of complexity. Although it is not an exact measure, it gives a rough 
estimation of the complexity, for comparing SLO and LP algorithms. Our simulations are performed in 
MATLAB7 environment using an AMD Athlon sempron 2400+, 1.67GHz processor with 512MB of 
memory, and under Microsoft Windows XP operating system. 

Table U shows the gradual improvement in the output SNR after each iteration, for a typical run of 
SLO. Moreover, for this run, the total time and final SNR have been shown for SLO, for LP, and for 
FOCUS S. It is seen that SLO performs two orders of magnitude faster than LP, while it produces a better 
SNR (in some applications, it can be even three orders of magnitudes faster: see Experiment 6). Figure |2] 
shows the actual source and it's estimations at different iterations for this run of SLO. 

The experiment was then repeated 100 times (with the same parameters, but for different randomly 
generated sources and mixing matrices) and the values of SNR (in dB) obtained over these simulations 
were averaged. These averaged SNR's for SLO, LP, and FOCUSS were respectively 30.85dB, 26.70dB, 
and 20.44dB; with respective standard deviations 2.36dB, 1.74dB and 5.69dB. The minimum values of 
SNR for these methods were respectively 16.30dB, 18.37dB, and 10.82dB. Among the 100 runs of the 
algorithm, the number of experiments for which SNR>20dB was 99 for SLO and LP, but only 49 for 
FOCUSS. 
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In the second part of the experiment, we use the same parameters as in the first part, except = 0.01 
to model the noise of the sources in addition to AWG noise modeled by o"„. The averaged SNR's for 
SLO, LP, and FOCUSS were respectively 25.93dB, 22.15dB and 18.24; with respective standard deviations 
1.19dB, 1.23dB and 3.94dB. 

Experiment 2. Dependence on the parameters 

In this experiment, we study the dependence of the performance of SLO to its parameters. The sequence 
of a is always chosen as a decreasing geometrical sequence aj = caj-i, I < j < J, which is determined 
by the first and last elements, ai and aj, and the scale factor c. Therefore, when considering the effect 
of the sequence of a, it suffices to discuss the effect of these three parameters on the performance. 
Reasonable choice of ai, and also approximate choice of /i have already been discussed in Remarks 2 
to 5 of Section |inl Consequently, we are mainly considering the effects of other parameters. 

The general model of the sources and the mixing system, given by ( [33l ) and (|34l ). has four essential 
parameters: (Ton> Oj^p, an, and p. We can control the degree of source sparsity and the power of the 
noise by changing^ k = mp and (T„. We examine the performance of SLO and its dependence to these 
parameters for different levels of noise and sparsity. In this and in the foUowings, except Experiment 6, 
all the simulations are repeated 100 times with different randomly generated sources and mixing matrices 
and the values of the SNR's (in dB) obtained over these simulations are averaged. 

Figures [3] represents the averaged SNR (as the measure of performance) versus the scale factor c, for 
different values of A; = mp and (T„. It is clear from Fig. Oa) that SNR increases when c increases form 
zero to one. However, when c exceeds a critical value (0.5 in this case), SNR remains constant and does 
not increase anymore. 

Generally, the optimal choice of c depends on the application. When SNR is the essential criterion, 
c should be chosen large, resulting in a more slowly decreasing sequence of a, and hence in a higher 
computational cost. Therefore, the choice of c is a trade-off between SNR and computational cost. 
However, as seen in the figures, when c approaches to unity, SNR does not increase infinitely. In Fig. Ha), 
the optimal value of c, i.e. the smallest value of c that achieves the maximum SNR, is approximatively 
c = 0.5. However, it is clear from Fig. |3tb) that the optimal choice of c depends on the sparsity, but not 
on the noise power Exact calculation of the optimal c might be very hard. To guarantee an acceptable 
performance, it suffices to choose c greater than its optimal value. 

'"Note that the sources are generated using the model l |33t . Therefore, for example k = 100 does not necessarily mean that 
exactly 100 sources are active. 
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Fig. 3. Performance of SLO as function of c for the case m = 1000 and n = 400 (SNR's are averaged over 100 runs of the 
algorithm), cri is fixed to 1 (large enough) and aj is fixed to 0.01 (small enough). In (a) k is fixed to 100 and effect of noise 
is investigated. In (b) an is fixed to 0.01 and effect of sparsity factor is analyzed. 

From [15], we know that A; < n/2 is a theoretical limit for sparse decomposition. However, most of 
the current methods cannot approach this limit (see Experiment 3). In Fig. Hb), k = 190 ~ 200 = n/2 
is plotted, and it is clear that by choosing c larger than 0.9 an acceptable performance can be achieved 
(however, with a much higher computational cost). 

In Fig. m SNR is plotted versus — ln((Tj) (where cjj is the last and smallest a) for different values of k 
and (Tn- In Fig. Sfa), for the noiseless case, SNR increases linearly, by increasing in — \\i{aj). Although 
not directly clear from the figure, calculation of the obtained values of the figure better shows this linear 
relationship. This confirms the results of Theorem [T] (accuracy is proportional to the final value of a). In 
the noisy case, SNR increases first, and then remains constant. As was predicted by Theorem [3l in the 
noisy case the accuracy is bounded and might not be increased arbitrarily. 

Generally, the optimal choice of cjj depends on the application. In applications in which SNR is highly 
more important than the computational load, aj should be chosen small, resulting in a larger sequence 
of a, and hence a higher computational cost. However, excessively small choice of oj (smaller than the 
optimal choice) does not improve SNR (in fact SNR is slightly decreased. Recall also the Remark 6 
of Section HID ). It is clear from Fig. |4] that the optimal choice of oj depends on the noise power, but 
not on the sparsity. Exact calculation of the optimal oj might be very hard. To guarantee an acceptable 
performance, it suffices to choose oj less than its optimal value. 

From this experiment it can be concluded that, although finding optimal values of the parameters for 
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Fig. 4. Performance of SLO versus aj for m = 1000 and n — 400 (SNR's are averaged over 100 runs of the algorithm), ai is 
fixed to 1 (large enough) and c is fixed to 0.8 (near enough to one). In (a) k is fixed to 100 and effect of noise is investigated. 
In (b) ffn is fixed to 0.01 and effect of sparsity factor is analyzed. 




L 



Fig. 5. Averaged SNR (on 100 runs of the algorithm) versus L for the case m — 1000 and n = 400, k = 100 and a„ — 0.01 

optimizing the SNR with the least possible computational cost may be very hard, the algorithm is not 
very sensitive to the parameters, and it is not difficult to choose a sequence of a (i.e., c and oj). 

Finally, to study the effect of L (number of iterations of the internal steepest ascent loop), the parameters 
are fixed to the values used at the beginning of Experiment 1, and the averaged SNR (over 100 runs of 
the algorithm) is plotted versus L in Fig. \5\ It is clear from this figure that the final SNR achieves its 
maximum for a small L, and no longer improves by increasing it, while the computation cost is directly 
proportional to L. Hence, as it was said in Remark 1 of Section |ni] and Remark 3 of Section |IV-A[ we 
generally fix L to a small value, say L = 3. 
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Fig. 6. Averaged SNR's (over 100 runs of the algorithm) versus k, the average number of active sources, for SLO algorithm 
with several values of c, and for LP. The parameters are m — 1000, n — 400, cri = 1, oj = 0.01, (7,i = 0.01. 

Experiment 3. Effect of sparsity on the performance 

How much sparse a source vector s should be to make its estimation possible using our algorithm? 
Here, we try to answer this question experimentally. As mentioned before, there is the theoretical limit 
of n/2 on the maximum number of active sources to insure the uniqueness of the sparsest solution. But, 
practically, most algorithms cannot achieve this limit [15], [13]. 

To be able to measure the effect of sparsity, instead of generating the sources according to the model 
(|33] ). we randomly activate exactly k elements out of m elements. Figure [6] then shows the output SNR 
versus k, for several values of c, and compares the results with LP. Note that SLO outperforms LP, 
specially in cases where A; ~ n/2 = 200. 

It is obvious from the figure that all methods work well if k is smaller than a critical value, and they 
start breaking down as soon as k exceeds this critical value. Figure [6] shows that the break-down value 
of k for LP and for SLO with c = 0.5 is approximately 100 (half of the theoretical limit n/2 = 200). 
For c = 0.8 and c = 0.95, this break-down value is approximately 150 and 180. Consequently, with our 
algorithm, it is possible to estimate less sparse sources than with LP algorithm. It seems also that by 
pushing c toward 1, we can push the breaking-down point toward the theoretical limit n/2; however, the 
computational cost might become intolerable, too. 

Experiment 4. Robustness against noise 

In this experiment, the effect of the noise variance, cr„, on the performance is investigated for different 
values of aj and is compared with the performance of LP. Figure |7] depicts SNR versus (T„ for different 
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Fig. 7. Averaged SNR's (over 100 runs of the algorithm) versus the noise power a„ for different values of aj, and for LP. 
The parameters are m = 100, n = 400, k = 100, g\ = 1, and c = 0.8. 

values of oj for both methods. The figure shows the robustness of SLO against small values of noise. In 
the noiseless case {pn < -02), LP performs better (note that (Toff = 0, and in SLO, a is decreased only 
to 0.005). In the noisy case, smoothed-^'^ achieves better SNR. Note that the dependence of the optimal 
aj to a-n is again confirmed by this experiment. 

Experiment 5. Number of sources and sensors 

In this experiment, we investigate the effect of the system scale (i.e., the dimension of the mixing 
matrix, m and n) on the performance and justify the scalability of SLO. 

First, to analyze the effect of the number of mixtures (n), by fixing m to 1000, SNR is plotted versus 
n, for different values of k in Fig. Oa). It is clear from this figure that both methods perform poorly 
while 2k > n (note that the sparsest solution is not necessarily unique in this case). SLO performs better 
as soon as n exceeds 2k (the theoretical limit for the uniqueness of the sparsest solution). 

Then, to analyze the effect of scale, n is fixed to [0.4m], and SNR is plotted versus log(m) for 
different values of k in Fig. (Ub). From this figure it is obvious that SLO and LP perform similarly for 
small values of A; (A; ~ 10), but SLO outperforms LP for larger values of A; (A; ~ 100). 

Experiment 6. Computational Cost in BSS applications 

In BSS and SCA appUcations, the model is written as x(t) = As{t) + n(t), 1 < t < T, where T 
is the number of samples. In matrix form, this can be written as X = AS + N, where X, S, and N are 
respectively n x T, m x T and n x T matrices, where each column stands for a time sample. 

For solving this problem with LP, the system x(t) = As(t) + n(t) should be individually solved for 
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Fig. 8. Effect of scale on performance (SNR's are averaged over 100 runs of the algorithm), an ~ 0.01, c = 0.8, ai — 1, 
aj — 0.01, and SLO is compared with LP. In (a) m is fixed to 1000 and SNR is plotted versus n for different values of k. In 
(b) SNR is plotted versus log(m) for different values of k, while n is fixed to [0.4m]. 

each value of 1 < t < T. This trivial approach can also be used with SLO. However, since all the steps 
of SLO presented in Fig. [J are in matrix form, it can also be directly run on the whole matrices X and 
S. Because of the speed of the current matrix multiplication algorithmic^, this results in an increased 
speed in the total decomposition process. 

Figure |9] shows the average computation time per sample of SLO for a single run of the algorithm, 
as a function of T for the case m = 1000, n = 400 and k = 100. The figure shows that by increasing 
T, average computation time first increases, then decreases and reach to a constant. For T = 1, the 
computation time is 266ms (this is slightly different with the time of the first experiment, 227ms, because 
these are two different runs). However, for T = 10000, the average computation time per sample decreases 
to 38ms. In other words, in average, SLO finds the sparse solution of a linear system of 400 equations 
and 1000 unknowns just in 38ms (compare this with 30s for £i-magic, given in Experiment 1). 

"Let A, s and S be n x T, m x 1 and m x T matrices, respectively. In MATLAB, the time required for the multiplication 
AS is highly less than T times of the time required for the multiplication As. This seems to not be due to the MATLAB's 
interpreter, but a property of Basic Linear Algebra Sub-programs (BLAS). BLAS is a free set of highly optimized routines for 
matrix multiplications, and is used by MATLAB for its basic operations. This property does not exist in MATLAB 5.3 which 
was not based on BLAS. 
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Fig. 9. Average computation time per sample of SLO, as a function of T, number of (time) samples, for the case m — 1000, 
n = 400 and k — 100. cr,i is chosen 0.01 and the sequence of a is fixed to [1, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01], the same parameter 
used in first experiment. 

VI. Conclusions 

In this paper, we showed that the smoothed norm can be used for finding sparse solutions of 
an USLE. We showed also that the smoothed version of the £^ norm not only solves the problem of 
intractable computational load of the minimal £^ search, but also results in an algorithm which is highly 
faster than the state-of-the-art algorithms based on minimizing the norm. Moreover, this smoothing 
solves the problem of high sensitivity of £^ norm to noise. In another point of view, the smoothed 
provides a smooth measure of sparsity. 

The basic idea of the paper was justified by both theoretical and experimental analysis of the algorithm. 
In the theoretical part, Theorem [T] shows that SLO is equivalent to ^'^-norm for a large family of functions 
/cr. Theorem |2] gives a strong assessment for using £^-norm solution for initialization. This theorem also 
suggests that the minimal ("^ norm can be seen as a rough estimation of the sparse solution (like Method 
Of Frames), which will be modified in the future iterations. Theorem |3] justifies the robustness of SLO 
against noise. 

Other properties of the algorithm were studied experimentally. In particular, we showed that (1) the 
algorithm is highly faster than the state-of-the-art LP approaches (and it is even more efficient in SCA 
applications), (2) choosing suitable values for its parameters is not difficult, (3) contrary to previously 
known approaches it can work if the number of non-zero components of s is near n/2 (the theoretical 
limit for the uniqueness of the sparse solution), and (4) the algorithm is robust against noise. 
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Up to now, we have no theoretical result for determining how much 'gradual' we should decrease the 
sequence of a, and it remains an open problem for future works. Some open questions related to this issue 
are: Is there any sequence of a which guaranties escaping from local maxima for the Gaussian family 
of functions given in ([T])? If yes, how to find this sequence? If not, what happens with other families 
of functions F^? Moreover, is there any (counter-)example of A, s and x for which we can prove that 
for any sequence a the algorithm will get trapped into a local maximum? These issues, mathematically 
difficult but essential for proving algorithm convergence, are currently investigated. However, Experiment 
2 showed that it is fairly easy to set some parameters to achieve a suitable performance. Moreover, for an 
estimation s of the sparsest source (obtained by any method), we provided in Remark 5 of Section IIV-AI 
an upper bound for the estimation error. 

In addition, future works include better treatment of the noise in the model (|34l ) by taking it directly 
into account in the algorithm (e.g. by adding a penalty term to F^). Moreover, testing the algorithm on 
different applications (such as compressed sensing) using real-world data is under study in our group. 
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